In this project, we aim to implement a proper movie recommendation system for customers using the data frame provided by netflix back in 2005 for its prize competition. Here, we will supplement suitable data pre-processing methods, data explorations, and data visualizations to ultimately aid the readers of this report in understanding the work-flow of our project, which finally leads to the development of our models, complemented with various error analysis, and suggestions for improvements.
Firstly, we import some of the necessary packages for the development of our project and deployment of suitable models.
library(tidyverse)
library(ggplot2)
library(plotly)
library(glue)
library(scales)
library(recommenderlab)
library(dplyr)
library(recosystem)
library(tensorflow)
library(keras)
library(reticulate)The following code chunk is run in python. The purpose is to combine all the txt data into one large file, perform data cleansing, and integrate the movie_id into the dataset. This process will also remove all the lines containing missing values (i.e. the rows which is addressed for the movie_id). It will be executed if the stipulated data is not present in the specified directory.
import os
from datetime import datetime
start = datetime.now()
if not os.path.isfile('./data_input/data.csv'):
#read all txt file and store them in one big file
data = open('./data_input/data.csv', mode='w')
row = list()
files = ['./data_input/combined_data_1.txt', './data_input/combined_data_2.txt',
'./data_input/combined_data_3.txt', './data_input/combined_data_4.txt']
for file in files:
print('reading ratings from {}...'.format(file))
with open(file) as f:
for line in f:
del row[:]
line = line.strip()
if line.endswith(':'):
#all are rating
movid_id = line.replace(':', '')
else:
row = [x for x in line.split(',')]
row.insert(0, movid_id)
data.write(','.join(row))
data.write('\n')
print('Done.\n')
data.close()
print('time taken:', datetime.now() - start)## time taken: 0:00:00.019353
In this section, we will re-read the data into our IDE (i.e. in our case, to RStudio).
In this second section, we are trying better understand the current state of our data, and determine whether we would need to further transform our data due to underlying problems such as redundant data, out of range values, etc.
Here, we call the head function to get a sense of how our data looks like.
After observing the above head data, we can deduce that the column names and some of the data types (i.e date is not in date format) of the data are still not suitable, and would require transformation of column names and column data type to the appropriate ones before it is ready to be utilized further.
data <- data %>% rename(movie_id = V1,
customer_id = V2,
rating = V3,
date = V4
) %>%
mutate(date = as.Date(date))
data %>% head()Here, we check the number of movies inside our data.
## [1] 17770
Here, we obtain the amount of unique customer ids in our data.
## [1] 480189
Here, we are trying to see the dimension of our data frame, or also commonly known as the shape of our data frame. In this case, we can see the below figures indicating 100480507 rows along with 4 columns.
## [1] 100480507 4
Here, we can see the levels in which users gave rating towards the movies by applying the label encoding method towards the rating column in our data frame using its own rating as label with factor function in r. The results are the different unique values of the rating users have given. It turns out that all the ratings given are as expected (no out of bound values), ranging from 1 to 5.
## [1] "1" "2" "3" "4" "5"
Below, we check whether there would be any need for further processing due to missing value in the columns. By observing the results below, our current data frame does not have any missing value as it has been previously handled in our jupyter notebook file before being imported here.
## movie_id customer_id rating date
## 0 0 0 0
With the following code, we create a new dataframe for the visualization of our the amount of rating distribution towards the movies in our dataset. We store the amount of ratings given towards a particular movie within the ‘total’ column, along with tooltip column as the text which shows when a reader hover over our graph.
data_movie <- data %>%
mutate(movie_id = as.factor(movie_id)) %>%
group_by(movie_id) %>%
count(rating)
data_movie <- data_movie %>%
summarise(total = sum(n))## `summarise()` ungrouping output (override with `.groups` argument)
data_movie <- data_movie %>%
mutate(movie_id = fct_reorder(movie_id, desc(total))) %>%
mutate(tooltip = glue("Ratings given: {(total)}"))
data_movie %>% head()Apart from that, we are also creating a new data frame to be used for a data visualization of user rating distribution by counting the total amount for each rating given, and storing it in a column n.
data_summary <- data %>%
group_by(rating) %>%
count(rating) %>%
mutate(rating = as.factor(rating))
data_summaryBelow, we transform the newly created variable n into a probability distribution (total amount for a rating divided by total ratings given). Apart from that, we also define a new variable named tooltip with the function glue so that it can be attached to the bar graph later on.
label_percent <- label_dollar(suffix = '%' ,prefix = '')
data_summary$prob <- data_summary$n/sum(data_summary$n)*100
data_summary <- data_summary %>%
mutate(tooltip = glue("distribution: {label_percent(prob)}"))
data_summarydata_movie_graph <- ggplot(data_movie, aes(x = movie_id, y = total, text = tooltip)) +
geom_blank() +
theme(axis.line.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.x=element_blank(),
panel.grid.minor.x=element_blank(),
panel.grid.major.x=element_blank()) +
geom_col(position = "identity", fill="#f68060") +
labs(title = "Distribution of Ratings Given for the Movies",
x = "Movie_id",
y = "Amount of ratings")
ggplotly(data_movie_graph, tooltip = c("text"))Below, we are implementing a visualization towards the movie rating distribution using a bar graph with gradient color for better aesthetics.
data_sum_graph <- ggplot(data_summary, aes(x = rating, y = prob, text = tooltip, fill = prob)) +
geom_col(position = "identity") +
labs(title = "Probability Distribution of Movie Ratings",
subtitle = paste("For data set 1 (", unique(data$movie_id), " movies, ", unique(data$customers), " customers, and ", nrow(data$rating), " ratings"),
x = "Movie rating",
y = "Probability") +
scale_fill_gradient(low = "#e4333e", high = "#52171a") +
theme_minimal()
ggplotly(data_sum_graph, tooltip = c("text"))In the code chunk below, we are picking the top 10,000 users with the highest rating contributions towards the data set. However, as the last 10 users have the same amount of ratings given as shown using the tail function below, they would also be included in our new data frame, resulting in the total number of users of 10,009.
top_users <- data %>%
group_by(customer_id) %>%
count(rating)
top_users <- top_users %>%
group_by(customer_id) %>%
summarise(n = sum(n))## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by n
Additionally, we will also pick the top 2000 rated movies in the data frame as our second argument to be combined by matching values. Here, we will also store the result to another data frame named top_movies.
top_movies <- data %>%
group_by(movie_id) %>%
count(rating)
top_movies <- top_movies %>%
group_by(movie_id) %>%
summarise(n = sum(n))## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by n
Below, we define a new variable named combined_light which purpose is to subset our previously enormous data frame into a small sample of movies containing only top 10,009 customers and 2000 movies by considering their intersections.
combined_light <- inner_join(data, top_users, by = "customer_id")
combined_light <- inner_join(combined_light, top_movies, by = "movie_id")Here, we load the movie titles into a new data frame to be integrated later with our combined_light data frame. Subsequently, we rename the columns as desirable to match the column name of movie_id in our combined_light data frame, so that we can implement inner join towards the movie titles and our combined_light data frames, and store it within the combined_light data frame.
movie_titles <- movie_titles %>% rename(movie_id = V1,
year_released = V2,
title = V3)
combined_light <- inner_join(combined_light, movie_titles %>% mutate_at(c("movie_id", "year_released"), as.numeric), by = "movie_id")## Warning: Problem with `mutate()` input `movie_id`.
## ℹ NAs introduced by coercion
## ℹ Input `movie_id` is `.Primitive("as.double")(movie_id)`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
## Warning: Problem with `mutate()` input `year_released`.
## ℹ NAs introduced by coercion
## ℹ Input `year_released` is `.Primitive("as.double")(year_released)`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
In this section, we are going to build a neural network model using collaborative filtering technique with Keras neural network framework.
Before diving further into the model building, here we extract some insightful information that is required for the embedding layers’ parameters.
n_movies <- combined_light %>% select(movie_id) %>% distinct() %>% nrow()
n_users <- combined_light %>% select(customer_id) %>% distinct() %>% nrow()Here, we define new indices for each corresponding movie and customers to aid us in the next section of model building (So that the number of customer ids and movie ids are do not seem random).
dense_movies <- combined_light %>% select(movie_id) %>% distinct() %>% rowid_to_column()
combined_light <- combined_light %>% inner_join(dense_movies) %>% rename(movie_id_dense = rowid)## Joining, by = "movie_id"
dense_customer <- combined_light %>% select(customer_id) %>% distinct %>% rowid_to_column()
combined_light <- combined_light %>% inner_join(dense_customer) %>% rename(customer_id_dense = rowid)## Joining, by = "customer_id"
After that, we can separate the training data into two counterparts, namely the x axis as the predictor variables, whereas the y axis as the target variable.
set.seed(3359)
indices <- sample(nrow(combined_light), nrow(combined_light)*0.9)
#shuffle the order
combined_light_sample <- combined_light[indices,] %>% sample_frac()
validation <- combined_light[-indices,] %>% sample_frac()
x_train <- combined_light_sample %>% select(c(customer_id_dense, movie_id_dense)) %>% as.matrix()
y_train <- combined_light_sample %>% pull(rating)
x_train %>% head()## customer_id_dense movie_id_dense
## [1,] 9574 1457
## [2,] 749 652
## [3,] 9666 1880
## [4,] 4778 949
## [5,] 775 1686
## [6,] 887 1308
Check the dimension of our x_train matrix.
## [1] 6163201 2
In this section, we are going to directly build our model with the help of Keras neural network. Firstly, we need to set an embedding layer dimension according to our likes which will be the output dimension for each predictor variable after being embedded. Our model workflow will resemble the following picture:
Firstly, we will implement embedding layers for the input data which have been transformed into matrices with single dimensions. In this case, we will have two layers acting as the input layers of our deep learning model. Apart from that, we will also have to initialize an embedding dimension in which we desire how our embedding layers’ dimensions will be. This argument however, is a tunable parameter which could also affect the outcome of our model training. For the sake of this research, we will set it to 32.
embedding_dim <- 32
# input layers
input_users <- layer_input(shape = 1, name = "users")
input_movies <- layer_input(shape = 1, name = "movies")
user_embeddings <- input_users %>%
layer_embedding(
input_dim = n_users+1,
output_dim = embedding_dim,
name = "user_embeddings"
)
movie_embeddings <- input_movies %>%
layer_embedding(
input_dim = n_movies+1,
output_dim = embedding_dim,
name = "movie_embeddings"
) After having finished setting up the early stages of our model, we will also have to set up the layers which will handle the information we supply into meaningful prediction. Hence, we need to implement a dot product for the two embedding layers as a mean for multiplication of both embeddings. For this experiment, we will set up the axes as 2 as the desired outcome would be in a matrix form as opposed to the axes of 1 that of a vector. Ultimately, we will finalize our prediction using a layer dense along with a relu activation function which will inhibit the value from going under 0 as the result is expected to be a value between 0 and 5 (the prediction is a rating).
dot <- layer_dot(
inputs = list(user_embeddings, movie_embeddings),
axes = 2,
name = "dot_product"
)
pred <- dot %>% layer_dense(
units = 1,
activation = "relu",
name = "rating_prediction"
)Finally, after successfully implementing the layers with appropriate values and attributes, we are ready to build a complete deep learning model for the purpose of giving movie recommendations towards a customer depending on their previous movies experience. The metric which we will be using in this case is MAE (mean absolute error) to measure the level of accuracy attained by our model.
\[ MAE = \frac{\sum_{i = 1}^n{|y_i - x_i|}}{n} \]
model <- keras_model(inputs = c(input_users, input_movies), outputs = pred)
model %>% compile(
optimizer = "rmsprop",
loss = "mse",
metric = "mae"
)
# inspect model
summary(model)## Model: "model_3"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Connected to
## ================================================================================
## users (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## movies (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## user_embeddings (Embeddin (None, 1, 32) 320320 users[0][0]
## ________________________________________________________________________________
## movie_embeddings (Embeddi (None, 1, 32) 64032 movies[0][0]
## ________________________________________________________________________________
## dot_product (Dot) (None, 1, 1) 0 user_embeddings[0][0]
## movie_embeddings[0][0]
## ________________________________________________________________________________
## rating_prediction (Dense) (None, 1, 1) 2 dot_product[0][0]
## ================================================================================
## Total params: 384,354
## Trainable params: 384,354
## Non-trainable params: 0
## ________________________________________________________________________________
In this section, we perform a training towards our model using the 0.9 of the total population as our dataset. After that, we also need to split further a portion of our training data (in this case 0.1 of the training dataset) as our training-validation dataset for the evaluation of our model.
The following is the plot of our model training along with the training-validation results.
After that, we can further evaluate our model with the previously created validation data set, which stores unseen data of the same distribution as the model.
x_validation <- validation %>% select(c("customer_id_dense", "movie_id_dense")) %>% as.matrix()
y_validation <- validation %>% pull(rating) %>% as.matrix()The following is the result on our validation dataset.
model %>% evaluate(list(x_validation[,"customer_id_dense", drop = FALSE], x_validation[,"movie_id_dense", drop = FALSE]), y_validation)## loss mae
## 0.7068805 0.6452426
With the following codes, we will merge the titles of the movies into our model result in a matrix form. It will also be shown how the matrix would look like after performing this operation using a subsetting operation of the matrix.
movie_embeddings <- model %>%
get_layer("movie_embeddings") %>%
get_weights() %>%
.[[1]]
movie_embeddings <- movie_embeddings[1:2000, ]titles <- combined_light %>%
select(movie_id_dense, title) %>%
distinct() %>%
arrange(movie_id_dense) %>%
mutate(title = title %>% str_remove("\\(.+\\)") %>% str_trim())
dim(titles)## [1] 2000 2
## [,1] [,2] [,3] [,4]
## What the #$*! Do We Know!? 0.04318949 -0.01487871 -0.03894392 -0.01927355
## 7 Seconds 0.41687262 0.55554485 -0.50574541 0.46071777
## Immortal Beloved 0.33558050 0.37755719 -0.45456252 -0.87704104
## Never Die Alone 1.03216648 1.14192116 -1.01735055 1.56010592
## Lilo and Stitch 0.33742619 0.20161268 -0.45891753 -0.66213036
## Something's Gotta Give 1.35063052 1.35863245 -0.94591588 -0.67308241
## Aqua Teen Hunger Force: Vol. 1 1.18078256 1.01571596 -1.14583123 -0.73814970
## Spitfire Grill 1.13011014 1.27259791 -1.17664480 1.30620992
## Rudolph the Red-Nosed Reindeer 0.86133909 1.11745095 -0.89220899 1.04572272
## The Weather Underground 0.95584351 1.21773434 -0.84877801 0.38804615
Here, we can see how the movies are inter-related with each other from the movie_embeddings. If the reader is willing to examine this graph more deeply, please do not hesitate to zoom into the clusters of the graph, and determine whether the relation between the movies in a certain cluster is appropriate or not.
n_words_to_plot <- 2000
tsne <- Rtsne::Rtsne(
X = movie_embeddings[1:n_words_to_plot,],
perplexity = 30,
pca = FALSE
)
p <- tsne$Y %>%
as.data.frame() %>%
mutate(word = row.names(movie_embeddings)[1:n_words_to_plot]) %>%
ggplot(aes(x = V1, y = V2, label = word)) +
geom_text(size = 3)
plotly::ggplotly(p)We also experimented with a rather conventional method of prediction by generating correlation values between two variables (in this case, they are movie titles) to see how closely they relate to each other. As we mentioned in our proposal, we used the Pearson method to calculate the correlation coefficients. The following codes for this R Pearson Correlation section are implemented with python.
## Creating the dataframe from combined_light.csv file..
df = pd.read_csv('./data_input/combined_light.csv', sep=',')
df.date = pd.to_datetime(df.date)
print('Done.\n')## Done.
## Unnamed: 0 movie_id customer_id rating date n.x n.y
## 0 1 8 1744889 1 2005-08-25 2686 14910
## 1 2 8 1488844 4 2005-05-12 2206 14910
## 2 3 8 306466 4 2005-07-01 2299 14910
## 3 4 8 1987434 4 2005-10-04 1133 14910
## 4 5 8 573364 1 2005-11-22 1300 14910
## ... ... ... ... ... ... ... ...
## 9630919 9630920 17764 453585 3 2005-03-06 1906 64957
## 9630920 9630921 17764 1055714 5 2005-06-20 1425 64957
## 9630921 9630922 17764 2180413 4 2005-09-12 1466 64957
## 9630922 9630923 17764 477466 4 2005-10-17 1258 64957
## 9630923 9630924 17764 1804819 1 2005-11-03 1376 64957
##
## [9630924 rows x 7 columns]
df_title = pd.read_csv('./data_input/movie_titles.csv', sep=',', names=['movie_id','year','title'], encoding = 'latin')
df_title## movie_id year title
## 0 1 2003.0 Dinosaur Planet
## 1 2 2004.0 Isle of Man TT 2004 Review
## 2 3 1997.0 Character
## 3 4 1994.0 Paula Abdul's Get Up & Dance
## 4 5 2004.0 The Rise and Fall of ECW
## ... ... ... ...
## 17765 17766 2002.0 Where the Wild Things Are and Other Maurice Se...
## 17766 17767 2004.0 Fidel Castro: American Experience
## 17767 17768 2000.0 Epoch
## 17768 17769 2003.0 The Company
## 17769 17770 2003.0 Alien Hunter
##
## [17770 rows x 3 columns]
## Unnamed: 0 movie_id ... year title
## 0 1 8 ... 2004.0 What the #$*! Do We Know!?
## 1 2 8 ... 2004.0 What the #$*! Do We Know!?
## 2 3 8 ... 2004.0 What the #$*! Do We Know!?
## 3 4 8 ... 2004.0 What the #$*! Do We Know!?
## 4 5 8 ... 2004.0 What the #$*! Do We Know!?
## ... ... ... ... ... ...
## 9630919 9630920 17764 ... 1998.0 Shakespeare in Love
## 9630920 9630921 17764 ... 1998.0 Shakespeare in Love
## 9630921 9630922 17764 ... 1998.0 Shakespeare in Love
## 9630922 9630923 17764 ... 1998.0 Shakespeare in Love
## 9630923 9630924 17764 ... 1998.0 Shakespeare in Love
##
## [9630924 rows x 9 columns]
In short, we processed our filtered dataset to create a dataframe with rows of customer_id, columns of movie titles, and ratings that users have given for the movies as values in the table as shown below.
## title 10 Things I Hate About You 12 Angry Men ... Zoolander sex
## customer_id ...
## 769 2.0 5.0 ... 1.0 2.0
## 1333 NaN 2.0 ... 2.0 3.0
## 2213 3.0 NaN ... 1.0 3.0
## 2455 2.0 4.0 ... NaN NaN
## 2787 NaN 4.0 ... 5.0 NaN
## ... ... ... ... ... ...
## 2648502 NaN 5.0 ... 3.0 NaN
## 2648589 3.0 NaN ... 4.0 4.0
## 2648734 NaN NaN ... NaN NaN
## 2648869 3.0 NaN ... 3.0 4.0
## 2648885 4.0 NaN ... NaN NaN
##
## [10009 rows x 1981 columns]
By transforming our dataset into this dataframe, then it is easier for us to calculate the correlation between two movies (columns). Lucky for us, pandas has a .corrwith() function that can specifically carry out this task. All we have to do is pick a column and use the .corrwith function to compare the correlation coefficients with the whole dataframe. We then sort the result descendingly starting from the highest correlation coefficient to find out the most similar movies to the movie that we picked which will be shown in the prediction section.
For this experimentation, we selected a random customer which we are going to implement a prediction upon, based on his/her reviews on previously watched movies. With the below code, we are trying to identify which are the movies that the customer has not watched an would be available for us to recommend.
new_customer_id <- 999
# get movies watched by our user
movies_watched <- combined_light %>%
filter(customer_id_dense == new_customer_id) %>%
pull(movie_id_dense)
# get all available movies
all_movies <- combined_light %>%
distinct(movie_id_dense) %>%
pull()
# identify movies not watched
movies_not_watched <- setdiff(all_movies, movies_watched)
movie_options <- combined_light %>%
filter(movie_id_dense %in% movies_not_watched) %>%
distinct(movie_id_dense, title)
movie_optionscustomer_options <- expand.grid(
user_id = new_customer_id,
movie_id_dense= movies_not_watched
) %>%
as.matrix()Here, we perform a prediction using our model towards the inputs of customer_id_dense which has been renamed as user_id in this case, and movie_id_dense.
inputs <- list(
customer_options[, "user_id", drop = FALSE],
customer_options[, "movie_id_dense", drop = FALSE]
)
pred <- model %>% predict(inputs)
head(pred)## , , 1
##
## [,1]
## [1,] 2.411670
## [2,] 3.326862
## [3,] 2.558096
## [4,] 2.849520
## [5,] 2.733004
## [6,] 3.581092
With the below code, we are trying to examine the movies that this particular customer has watched before along with the given ratings, to subsequently determine whether the given prediction is reasonable or not.
combined_light %>%
filter(movie_id_dense %in% movies_watched) %>%
filter(customer_id_dense == new_customer_id) %>%
distinct(title, movie_id_dense, rating) %>%
arrange(desc(rating))The following is the prediction results sorted by the predicted rating in descending order. It is up for the reader to decide whether the prediction results make sense or not.
customer_options %>%
as_tibble() %>%
mutate(predictions = as.vector(pred)) %>%
left_join(movie_options, by = "movie_id_dense") %>%
arrange(desc(predictions))For our interim prototype, the following prediction using R Pearson’s correlation are currently limited to 1 film as an input. The following codes for this R Pearson Correlation section are implemented with python.
Here, we try to predict the movies which the user would like to watch if he/she likes the movie 2 Fast 2 Furious.
## title
## 2 Fast 2 Furious 1.000000
## The Fast and the Furious 0.583671
## Torque 0.520689
## Cradle 2 the Grave 0.497942
## XXX: State of the Union 0.497156
## Speed 2: Cruise Control 0.483907
## Fire Down Below 0.473065
## Kangaroo Jack 0.469838
## XXX: Special Edition 0.469115
## Gone in 60 Seconds 0.461562
## dtype: float64
The same goes for the following chunks, with each of their corresponding movie
## title
## 2001: A Space Odyssey 1.000000
## Dr. Strangelove 0.448140
## A Clockwork Orange 0.432427
## Apocalypse Now 0.411397
## The Bicycle Thief 0.390656
## Taxi Driver 0.386643
## The Graduate 0.379015
## Apocalypse Now Redux 0.374817
## Blade Runner 0.374072
## Brazil 0.372438
## dtype: float64
## title
## Fight Club 1.000000
## Snatch 0.424055
## 12 Monkeys 0.414158
## Eternal Sunshine of the Spotless Mind 0.406362
## Memento 0.403274
## Pulp Fiction 0.395666
## Reservoir Dogs 0.382496
## Seven 0.378241
## American History X 0.374645
## American Beauty 0.371214
## dtype: float64
## title
## Pulp Fiction 1.000000
## Reservoir Dogs 0.527381
## Kill Bill: Vol. 1 0.414636
## Fight Club 0.395666
## American Beauty 0.394854
## Kill Bill: Vol. 2 0.391102
## The Big Lebowski 0.374954
## Memento 0.372970
## True Romance 0.371049
## Taxi Driver 0.367332
## dtype: float64